I used the UCI HAR Weight Lifting Exercise dataset. I split the ‘training’ dataset into 60%/40%, and trained a random forest model on the 60%, using 5 fold cross-validation. The model predicted a 1.2% out of sample error. I was able to successfully predict classe in the remaining 40% of training data with 99.2% accuracy, meaning that the model had an actual 0.8% out of sample error. This model successfully predicted all 20 of the testing problems.
setwd("~/Documents/Reference/MOOC/Data Science/code/machlearn")
data <- read.csv("pml-training.csv",header=TRUE,sep=',',na.strings=c("NA","#DIV/0!",""))
finaltestdata <- read.csv("pml-testing.csv",header=TRUE,sep=',', na.strings=c("NA","#DIV/0!"))
# make reproducible
set.seed(345)
# setting up training and testing set locally; will train models on 60% of cases, and test against 40% to estimate out of sample error and see how we're doing.
inTrain = createDataPartition(data$classe, p = 0.6)[[1]]
pml.training = data[inTrain,]
pml.testing = data[-inTrain,]
The initial dataset had 160 variables, so the first order of business was to reduce the number of columns to only the ones with the best predictive power. From examining the data there were several columns which had a preponderance of ‘NA’ values (or #DIV/0! errors from Excel). I also eliminated columns which were not sensor-based and therefore unlikely to be predictive (notably the first eight columns, with subject name, time stamp and time window information, etc.). Finally, from the remaining columns I identified and eliminated those which were highly correlated.
# eliminate columns with NA
pml.training <- pml.training[,colSums(is.na(pml.training)) < nrow(pml.training) * 0.5]
# eliminate non-sensor readings
pml.training <-pml.training[,8:ncol(pml.training)]
# remove highly correlated variables
# calculate correlation matrix using every variable except the last one (which is classe, our outcome)
correlationMatrix <- cor(pml.training[,-ncol(pml.training)])
# eliminate attributes that are highly corrected (ideally >0.75)
pml.training <- pml.training[,-(findCorrelation(correlationMatrix, cutoff=0.75))]
# zero- or near-zero variance?
nearZeroVar(pml.training, saveMetrics= TRUE)
## freqRatio percentUnique zeroVar nzv
## yaw_belt 1.050336 14.35122283 FALSE FALSE
## gyros_belt_x 1.076155 1.08695652 FALSE FALSE
## gyros_belt_y 1.158576 0.56046196 FALSE FALSE
## gyros_belt_z 1.105014 1.39266304 FALSE FALSE
## magnet_belt_x 1.033493 2.53057065 FALSE FALSE
## magnet_belt_y 1.047739 2.30978261 FALSE FALSE
## roll_arm 52.717949 19.81148098 FALSE FALSE
## pitch_arm 70.931034 22.52038043 FALSE FALSE
## yaw_arm 33.704918 21.51834239 FALSE FALSE
## total_accel_arm 1.089354 0.55197011 FALSE FALSE
## gyros_arm_y 1.378125 3.04008152 FALSE FALSE
## gyros_arm_z 1.098726 1.97010870 FALSE FALSE
## magnet_arm_x 1.000000 11.10733696 FALSE FALSE
## magnet_arm_z 1.029851 10.59782609 FALSE FALSE
## roll_dumbbell 1.037975 87.60190217 FALSE FALSE
## pitch_dumbbell 2.632911 85.35156250 FALSE FALSE
## yaw_dumbbell 1.053333 86.98199728 FALSE FALSE
## total_accel_dumbbell 1.102217 0.34816576 FALSE FALSE
## gyros_dumbbell_x 1.067751 1.98709239 FALSE FALSE
## gyros_dumbbell_y 1.312139 2.20788043 FALSE FALSE
## gyros_dumbbell_z 1.061662 1.60495924 FALSE FALSE
## magnet_dumbbell_z 1.141509 5.58763587 FALSE FALSE
## roll_forearm 10.653153 14.75883152 FALSE FALSE
## pitch_forearm 63.891892 21.26358696 FALSE FALSE
## yaw_forearm 14.677019 13.90964674 FALSE FALSE
## total_accel_forearm 1.081925 0.56895380 FALSE FALSE
## gyros_forearm_x 1.048544 2.36922554 FALSE FALSE
## gyros_forearm_y 1.120370 6.03770380 FALSE FALSE
## gyros_forearm_z 1.104530 2.45414402 FALSE FALSE
## accel_forearm_x 1.068966 6.55570652 FALSE FALSE
## accel_forearm_z 1.010753 4.64504076 FALSE FALSE
## magnet_forearm_x 1.083333 12.00747283 FALSE FALSE
## magnet_forearm_y 1.196078 15.19191576 FALSE FALSE
## magnet_forearm_z 1.000000 13.42561141 FALSE FALSE
## classe 1.469065 0.04245924 FALSE FALSE
The resulting dataset is much smaller.
dim(pml.training)
## [1] 11776 35
I used a random forest model with 5-fold cross-validation. Cross-fold validation estimates the out of sample error to be 1.2% (that is, the accuracy to be 98.8%).
model <- train(classe ~ ., data=pml.training, method="rf", trControl=trainControl(method="cv",number=5),
allowParallel=TRUE, tuneGrid=data.frame(mtry=10))
print(model)
## Random Forest
##
## 11776 samples
## 34 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 9421, 9421, 9420, 9422, 9420
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.9878569 0.9846376 0.001712162 0.002166542
##
## Tuning parameter 'mtry' was held constant at a value of 10
##
I also tested this model against the known 40% of the original training set, just to see how well it predicted those variables.
predicted <- predict(model,newdata=pml.testing)
table(pml.testing$classe,predicted)
## predicted
## A B C D E
## A 2228 1 0 1 2
## B 14 1500 4 0 0
## C 0 11 1344 13 0
## D 0 0 13 1272 1
## E 0 0 4 3 1435
prop.table(table(pml.testing$classe,predicted),1)
## predicted
## A B C D E
## A 0.9982078853 0.0004480287 0.0000000000 0.0004480287 0.0008960573
## B 0.0092226614 0.9881422925 0.0026350461 0.0000000000 0.0000000000
## C 0.0000000000 0.0080409357 0.9824561404 0.0095029240 0.0000000000
## D 0.0000000000 0.0000000000 0.0101088647 0.9891135303 0.0007776050
## E 0.0000000000 0.0000000000 0.0027739251 0.0020804438 0.9951456311
prop.table(table(pml.testing$classe,predicted),2)
## predicted
## A B C D E
## A 0.9937555754 0.0006613757 0.0000000000 0.0007757952 0.0013908206
## B 0.0062444246 0.9920634921 0.0029304029 0.0000000000 0.0000000000
## C 0.0000000000 0.0072751323 0.9846153846 0.0100853375 0.0000000000
## D 0.0000000000 0.0000000000 0.0095238095 0.9868114818 0.0006954103
## E 0.0000000000 0.0000000000 0.0029304029 0.0023273856 0.9979137691
Since the accuracy is 99.2% - even better than predicted by the cross-validation - this is a good result. And indeed all 20 answers check out. Success!
Just a quick multiplot to look at the 35 columns used and to see whether any are obviously ones worth exploring further or disregarding? (Yes, some level of data visualization overload, but I did put it at the end, and you can consider this a lattice / facet grid.)
library(ggplot2)
library(grid)
library(gridExtra)
plots=list()
predictors = colnames(pml.training[-ncol(pml.training)])
for (pred in predictors) {
p <- ggplot(pml.training, aes_string(x='classe',y=pred)) + geom_boxplot() + geom_jitter(color="#56B4E9", alpha=0.05)
plots[[pred]] <- p
}
do.call("grid.arrange", c(plots, ncol=3))